add moriio transfer engine#1742
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30142a6a0a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| [[src_offset]], | ||
| [remote_mem], | ||
| [[pull.dst_offset]], | ||
| [[pull.length]], |
There was a problem hiding this comment.
Use source buffer size for RDMA write length
The sender already reads src_size from its own buffer table, but the transfer length is taken from pull.length (receiver-provided metadata). If that metadata is stale or malformed, the sender can truncate a valid payload (when pull.length < src_size) or attempt to read beyond the stored buffer (when pull.length > src_size) while still treating the operation as a normal transfer path. The write length should be validated against src_size (or derived solely from src_size) before calling batch_write.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed7bde1f46
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not bound: | ||
| try: | ||
| sock.bind(f"tcp://{self.host}:*") | ||
| bound = True | ||
| except zmq.ZMQError as exc: |
There was a problem hiding this comment.
Fail fast instead of rebinding sender ZMQ to random port
When bind on self.zmq_port fails with EADDRINUSE, this code silently rebinds to an OS-assigned port, but the metadata-less receive path still resolves the sender using configured sender_host/sender_zmq_port (see get() -> _query_metadata_from_sender(), and call sites like kv_transfer_manager.receive_kv_cache_for_request and chunk_transfer_adapter._poll_single_request that call connector.get(..., metadata=None)). In that context, receivers keep querying the old configured port and time out indefinitely even though the sender is running, so this should fail fast or explicitly propagate the new port to receivers.
Useful? React with 👍 / 👎.
| stale = [k for k, v in self._local_buffers.items() if now - v[5] > _BUFFER_TTL_SECONDS] | ||
| for k in stale: |
There was a problem hiding this comment.
Exclude in-flight buffers from TTL reclamation
TTL cleanup reclaims entries purely by age, but _handle_pull_request() only reads _local_buffers under lock and then performs batch_write(...); st.Wait() without marking that entry in-flight; if a receiver pulls a buffer near/after the 300s TTL, the listener thread can reclaim and release that allocator region concurrently while the RDMA write is still pending. This creates a race that can corrupt transfers or cause intermittent failures under delayed pulls, so stale-buffer GC needs an in-flight guard (or timestamp refresh) before transfer starts.
Useful? React with 👍 / 👎.
|
thanks @inkcherry for the PR. |
|
@inkcherry is MoRII beneficial for intranode communication? will you enable it for intranode communication? Because vLLM-Omni also has many use cases on single node deployment. |
Hi, @tjtanaa, I've added intra-node benchmark results. Mori delivers correct outputs with competitive performance. |
|
Hi @inkcherry, thanks for the PR. We tried to test with
There's no mention of Would you be able to share the reproduction steps you used to verify this? |
|
hi, @junkang1991 , thanks for the try, during the initialization phase, I can see such a log. The log shows that XGMI is automatically selected as the mori backend. I have added logs wrapping If you still encounter issues, could you try updating Mori to a newer commit? |
…node XGMI Signed-off-by: junkang1991 <junkangchow@gmail.com>
|
@inkcherry @junkang1991 please fix the readthedocs and pre-commit error. |
|
@tjtanaa , thanks for the reminder, fixd, cc @junkang1991 |
|
This refactor #1908 remove multi-node deployment on vllm-omni codebase. Current single node deployment still work. I think we can move forward with the PR at this stage, I've removed the inter-node configuration script for now. Once the vllm-omni codebase supports inter-node deployment, we will add it back after re-testing. |
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
|
|
||
| ## When to Use | ||
|
|
||
| Best for high-performance multi-node data transfer using Mori RDMA transfer engine. |
There was a problem hiding this comment.
can you update the description for now. Stating this reason #1742 (comment) . And mention that intra-node is currently supported.
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Clarifies how to opt this deploy yaml into true AMD Infinity Fabric XGMI via the HIP transport landed in Mooncake PRs vllm-project#1742 and vllm-project#1550, and pins the default value to ``protocol: "rdma"`` — the path that is actually end-to-end validated on this branch (three consecutive text completions at 18.7 / 16.8 / 16.4 s with 93 per-chunk RDMA GETs). XGMI opt-in requires three things in lock step and is therefore not the default: 1. A mooncake wheel rebuilt with ``-DUSE_HIP=ON`` and reinstalled into the container's Python env (stock wheels don't ship HIP transport). 2. ``memory_pool_device: "cuda"`` so Mooncake picks the HIP allocator. 3. ``MC_FORCE_MNNVL=1`` in the launch environment so ``TransferEngineImpl::init()`` installs the HIP transport instead of defaulting back to RDMA whenever the node has HCAs (the current auto-topology path treats RDMA as the preferred fabric, even when HIP is compiled). Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor
remove changes on public files
Pure cosmetic reformatting requested by ruff-format: - mooncake_transfer_engine_connector.py: collapse a 2-line f-string into one line. - mori_transfer_engine_connector.py: same. - utils/initialization.py: drop redundant indentation in the `_ROLE_BOUND_ZMQ_CONNECTORS` frozenset literal and collapse two list comprehensions in `get_connectors_config_for_stage` to a single line. Verified locally with `pre-commit run --all-files` (all hooks pass). Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
- P1 length: validate pull.length against src_size before batch_write (prevents silent truncation / cross-allocation OOB read). - P2 port: drop EADDRINUSE silent fallback; match Mooncake's fail-fast. - P2 TTL: TODO mirroring wzliu's note; same race deferred to follow-up. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Hi @gcanlin, we plan to share this PR this Wednesday. Does this meeting require any registration before the sharing session? |
|
|
||
|
|
||
| class ManagedBuffer: | ||
| """Zero-copy view into the global memory pool. |
There was a problem hiding this comment.
These classes (ManagedBuffer, BufferAllocator) with the same functionality as Mooncake TE connector have been moved to omni_connectors/utils/memory_pool.py. It's suggested to use those classes.
There was a problem hiding this comment.
Thanks for catching this! Done in afbf08e, Mori now import BufferAllocator / ManagedBuffer from omni_connectors/utils/memory_pool.
| except Exception as e: | ||
| logger.warning(f"Failed to terminate ZMQ context: {e}") | ||
|
|
||
| self.pool = None # type: ignore[assignment] |
There was a problem hiding this comment.
Can Mori unregister pin memory and shutdown engine for clean up?
There was a problem hiding this comment.
Thanks for the suggestion! Done in b6daf0a — close() now calls deregister_memory + deregister_remote_engine for each registered peer, then drops the IOEngine ref so GC runs the C++ destructor (mori 1.1.2 doesn't expose an explicit shutdown()).
|
Dual mode / chunk-transfer endpoints are not rank-aware: _inject_chunk_path_endpoints() derives ports as base_port + stage_id, with no TP rank term. If any stage runs with tensor_parallel_size > 1, same-stage workers will bind the same ZMQ port, and heterogeneous TP cannot route per-rank shards because the connector only has a single upstream sender_host/sender_zmq_port. I suggest to fail fast for TP>1 on this path or enforce that dual chunk transfer is TP=1-only. |
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Mirrors the Mooncake teardown flow: explicit ``deregister_memory`` for
the pinned pool plus ``deregister_remote_engine`` for every remote
registered via ``_ensure_remote_registered``, then drop the only
Python reference so the IOEngine C++ destructor unwinds backend and
RDMA fabric state under Python GC. ``mori.io.IOEngine`` does not
expose an explicit ``shutdown()``/``close()`` entry-point (confirmed
on mori 1.1.2.dev30, ``dir(mori.io.IOEngine)`` has no
shutdown/close/destroy/stop/cleanup/terminate), so the
drop-ref-and-GC approach is the supported teardown path.
``deregister_remote_engine`` is typed
``(self, engine_desc: libmori_pybinds.EngineDesc)`` in the pybind
binding, not a key string, so ``_registered_engines`` now stores
``{key -> EngineDesc}`` instead of ``set[str]`` and ``close()``
passes the actual desc back. Without this every teardown would
hit ``TypeError: incompatible function arguments`` that the
surrounding ``try/except`` swallowed silently and the remote
engines would never be deregistered.
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Conflicts resolved:
- vllm_omni/distributed/omni_connectors/utils/initialization.py:
keep both new symbols; main's ``KV_REPLICA_PORT_STRIDE`` (semantic
sibling of ``KV_RANK_PORT_STRIDE``) placed immediately after it,
followed by the chunk-path-side additions
(``_ROLE_BOUND_ZMQ_CONNECTORS`` set and ``_detect_local_ip``).
- vllm_omni/distributed/omni_connectors/factory.py:
keep both new ``register_connector`` calls
(``MoriTransferEngineConnector`` and
``YuanrongTransferEngineConnector``), grouped after
``YuanrongConnector``.
- vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:
take main side and drop the 4-line "imported from utils/memory_pool"
explanatory comment, matching what main settled on (and aligning
with ``MoriTransferEngineConnector``, which doesn't carry that
comment either).
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Thanks for the review! Done in ef596e7 , Mori's |
Per-edge support will land in a follow-up alongside the new OmniConnectorModelRunnerMixin (RFC vllm-project#1546 §2.5). Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
| # See ``MoriTransferEngineConnector`` for the rationale. | ||
| tp_world_size = get_tp_world_size() | ||
| if tp_world_size > 1: | ||
| raise NotImplementedError( |
There was a problem hiding this comment.
Here should also be reverted. Otherwise, models like Hunyuan Image, which uses TP, will fail.
There was a problem hiding this comment.
good catch, reverted in 8a92219
Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>
Replace the yaml link with an inline code path; mkdocs --strict rejects links to non-documentation files and broke the RTD build. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

Purpose
Add
MoriTransferEngineConnector— a new OmniConnector backend using Mori RDMA transfer engine for zero-copy data transfers between disaggregated pipeline stages. The implementation follows a similar architecture toMooncakeTransferEngineConnector, adapted to Mori'sIOEngine/MemoryDesc/EngineDescAPI.IOEngine.batch_write()with asyncTransferStatustrackingtorch.Tensor/bytesfast path bypassing serializationmoriis not installedChanges include the connector implementation, factory registration, module exports, design doc, and example stage config YAML.
Test Plan
Hardware: 3x AMD Instinct MI300X nodes (8 GPUs each), Mellanox ConnectX-7 400Gbps RoCE NIC (mlx5_0).
3-node disaggregated serving with Qwen2.5-Omni-7B, each stage on a separate node with TP=2:
Benchmark client:
Test Result
0→1 (Thinker → Talker)
1→2 (Talker → Code2Wav)
0→1 (Thinker → Talker)(intranode communication)
1→2 (Talker → Code2Wav)(intranode communication)
Mori shows lower in-flight latency compared to Mooncake, with comparable tx/rx performance. The correctness of the generated audio output was also manually verified by listening to the results. It provides a viable RDMA backend option for AMD devices.